AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Libraries to help with reading and manipulating data.
import pandas as pd
import numpy as np
# libaries to help with data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to help with data preprocessing.
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
# Removes the limit for the number of displayed columns.
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows.
pd.set_option("display.max_rows", 200)
# Library to split data.
from sklearn.model_selection import train_test_split
# To build model for prediction.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models.
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores.
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
classification_report,
)
# Library to suppress warnings (FutureWarning).
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)
# Read the Loan Modelling dataset from the google drive.
from google.colab import drive
drive.mount('/content/drive')
# assign the dataset to a dataframe - personal_loan_df.
personal_loan_df_original = pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')
# Create a copy of the original DataFrame to avoid modifying the original data.
personal_loan_df = personal_loan_df_original.copy() # Changed 'personal_load_df_original' to 'personal_loan_df_original'
# Return the first 5 rows of the dataset.
personal_loan_df.head(5)
# Return the last 5 rows of the dataset.
personal_loan_df.tail(5)
# Display the number of rows and columns in the dataset.
rows, columns = personal_loan_df.shape
# Print the number of rows and columns from the dataset.
print(f'Number of Rows: {rows:,}')
print(f'Number of Columns: {columns:,}')
# Remove any unnecessary columns, but only if they exist
if 'ID' in personal_loan_df.columns:
personal_loan_df.drop(["ID"], axis=1, inplace=True)
# Display summary information incl. the data types in the DataFrame.
personal_loan_df.info()
Observations:
Data Columns:
Data Types:
int64).float64).Variable Types:
Observations:
Memory Usage:
# Check for missing/null values in the dataset.
missing_values = personal_loan_df.isnull().sum()
# Output if there are any missing data points in the dataset.
if missing_values.sum() > 0:
print('There are missing values in the dataset.')
else:
print('There are no missing data points in the Personal Load dataset.')
# Display the statistical summary of the dataset.
personal_loan_df.describe(include="all").T
Observations:
Age:
Family Size:
Income:
Personal Loan Acceptance:
Mortgage:
Experience:
# Checking for anomalous values
personal_loan_df["Experience"].unique()
# Checking for experience <0
personal_loan_df[personal_loan_df["Experience"] < 0]["Experience"].unique()
# Correcting the experience values
personal_loan_df["Experience"].replace(-1, 1, inplace=True)
personal_loan_df["Experience"].replace(-2, 2, inplace=True)
personal_loan_df["Experience"].replace(-3, 3, inplace=True)
# checking the number of uniques in the zip code
personal_loan_df["ZIPCode"].nunique()
There are 467 unique zip codes
# present the number of unique occurences and corrspondending data counts in each zip code
zip_code_counts = personal_loan_df['ZIPCode'].value_counts()
print("Unique Zip Codes and their Counts:")
zip_code_counts
Observations:
Top ZIP Codes:
Top 20 ZIP Codes Overview:
Geographical Concentration:
Potential Regional Bias:
Diverse Representation:
# Converting the data type of categorical features to 'category'
## we will skip the Age, Experience, CCAvg, Mortgage, Income, Family and ZIP Code columns because they will have a lot of unique values
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
]
personal_loan_df[cat_cols] = personal_loan_df[cat_cols].astype("category")
# printing the number of occurrences of each unique value in each categorical column
for column in cat_cols:
print(personal_loan_df[column].value_counts())
print("-" * 50)
# Calculate the percentage of each unique value in the categorical columns
for column in cat_cols:
print(personal_loan_df[column].value_counts(normalize=True) * 100)
print("-" * 50)
Observations:
Education Level:
Loan Campaign Response:
Securities Account:
Certificate of Deposit (CD) Account:
Internet Banking Usage:
Credit Card Usage (Other Banks):
# Creating categories from Age, CC Avg, and Income to analyze the trend of borrowing Personal Loan
personal_loan_df["income_bin"] = pd.cut(
x=personal_loan_df["Income"],
bins=[0, 39, 98, 224],
labels=["Low", "Mid", "High"],
)
personal_loan_df["cc_spending_bin"] = pd.cut(
x=personal_loan_df["CCAvg"],
bins=[-0.0001, 0.7, 2.5, 10.0],
labels=["Low", "Mid", "High"],
)
personal_loan_df["age_bin"] = pd.cut(
x=personal_loan_df["Age"],
bins=[0, 35, 55, 67],
labels=["Young Adults", "Middle Aged", "Senior"],
)
Questions:
Question 1 - What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
# Import the matplotlib library and assign it to the alias 'plt'.
import matplotlib.pyplot as plt
# Import the seaborn library and assign it to the alias 'sns'
import seaborn as sns # Added this line to import seaborn
# Plot the distribution of mortgage using a histogram.
plt.figure(figsize=(12, 6))
sns.histplot(personal_loan_df['Mortgage'], bins=30, kde=True)
plt.title('Distribution of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
# Plot the distribution of mortgage using a boxplot to identify outliers.
plt.figure(figsize=(12, 6))
sns.boxplot(x=personal_loan_df['Mortgage'])
plt.title('Boxplot of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.grid(True)
plt.show()
# Plot the distribution of mortgage using a violinplot to identify outliers.
plt.figure(figsize=(12, 6))
sns.violinplot(x=personal_loan_df['Mortgage'])
plt.title('Violinplot of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.grid(True)
plt.show()
# plot the mortgage cumulative density distribution
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'personal_loan_df' is your DataFrame and 'Mortgage' is the column
plt.figure(figsize=(10, 6))
sns.ecdfplot(personal_loan_df['Mortgage'], label='Mortgage')
plt.title('Cumulative Distribution Function of Mortgage')
plt.xlabel('Mortgage Amount')
plt.ylabel('Cumulative Density')
plt.legend()
plt.grid(True)
plt.show()
# Number of customers without mortgage.
no_mortgage_count = personal_loan_df[personal_loan_df['Mortgage'] == 0].shape[0]
# Total number of customers.
total_customers = personal_loan_df.shape[0]
# Percentage of customers without mortgage.
percentage_no_mortgage = (no_mortgage_count / total_customers) * 100
print(f"Number of customers without mortgage: {no_mortgage_count:,}")
print(f"Total number of customers: {total_customers:,}")
print(f'Percentage of customers without mortgage: {percentage_no_mortgage:.2f}%')
Observations:
Mortgage Distribution Characteristics:
Action Items Regarding Outliers:
These outliers should be validated and appropriately treated in subsequent analyses.
Marketing Recommendation:
Question 2 - How many customers have credit cards?
# Calculate the number of customers with credit cards from other banks
credit_card_customers = personal_loan_df[personal_loan_df['CreditCard'] == 1].shape[0]
# Calculate the total number of customers
total_customers = personal_loan_df.shape[0]
# Calculate the percentage of customers with credit cards from other banks
percentage_credit_card_customers = (credit_card_customers / total_customers) * 100
# Print the results
print(f"Number of customers with credit cards from other banks: {credit_card_customers}")
print(f"Total number of customers: {total_customers}")
print(f"Percentage of customers with credit cards from other banks: {percentage_credit_card_customers:.2f}%")
import matplotlib.pyplot as plt
credit_card_counts = personal_loan_df['CreditCard'].value_counts()
# Create the pie chart
plt.figure(figsize=(8, 8)) # Adjust figure size as needed
plt.pie(credit_card_counts, labels=['No Credit Card', 'Has Credit Card'], autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Credit Card Usage')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()
Observations
Credit Card Ownership:
Marketing Recommendation:
Question 3 - What are the attributes that have a strong correlation with the target attribute (personal loan)?
# Calculate the correlation matrix, including categorical features as numeric
correlation_matrix = personal_loan_df.apply(lambda x: pd.factorize(x)[0]).corr() # Changed this line to include categorical features
# Filter for correlations with 'Personal_Loan'
personal_loan_correlations = correlation_matrix['Personal_Loan'].drop('Personal_Loan')
# Print the correlations
print("Correlation with Personal Loan:")
print(personal_loan_correlations)
# Set the figure size
plt.figure(figsize=(10, 6))
# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()
# Identify attributes with strong positive correlations
positive_correlations = personal_loan_correlations[personal_loan_correlations > 0.1]
print("\nAttributes with strong positive correlation with Personal Loan:")
print(positive_correlations)
Observations
Question 4 - How does a customer's interest in purchasing a loan vary with their age?
# Create age groups
bins = [20, 30, 40, 50, 60, 70, 80] # Define bin edges for age groups
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70+']
personal_loan_df['Age_Group'] = pd.cut(personal_loan_df['Age'], bins=bins, labels=labels, right=False)
# Group data by age group and calculate the proportion of customers interested in a loan
# Instead of .mean(), use .sum() / .count() to calculate the proportion
age_group_loan_interest = personal_loan_df.groupby('Age_Group')['Personal_Loan'].apply(lambda x: (x == 1).sum() / len(x))
# Create the bar chart
plt.figure(figsize=(12, 6))
plt.bar(age_group_loan_interest.index, age_group_loan_interest.values)
plt.xlabel('Age Group')
plt.ylabel('Proportion of Customers Interested in Loan')
plt.title('Interest in Loan by Age Group')
plt.grid(True)
plt.show()
Observations:
Loan Interest Across Age Groups:
Breakdown of Loan Interest by Age:
Key Observations:
Question 5 - How does a customer's interest in purchasing a loan vary with their education?
# Calculate the proportion of customers interested in loans for each education level.
loan_interest_by_education = personal_loan_df.groupby('Education')['Personal_Loan'].apply(lambda x: x.astype(int).sum()) / personal_loan_df.groupby('Education')['Personal_Loan'].count() # Changed this line to calculate proportion
# Print proportion of customers interested in loans by education level.
# 1: Undergrad; 2: Graduate; 3: Advanced/Professional
print("Loan Interest by Education Level")
print("-" * 60)
for education_level, proportion in loan_interest_by_education.items():
print(f"Education Level: {education_level}, Proportion: {proportion}")
# Visualize results.
loan_interest_by_education.plot(kind='bar', color='green')
plt.xlabel('Education Level')
plt.ylabel('Proportion of Customers Interested in Loans')
plt.title('Customer Interest in Purchasing Loans by Education Level')
plt.show()
Observations:
Loan Interest Among Education Groups:
Recommendation:
Additional Exploratory Data Analysis
Univariate Analysis
The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary
!pip install tabulate -q --user
# Import the tabulate function from the tabulate library.
from tabulate import tabulate
def summary(x):
'''
The function prints the 5 point summary and histogram, box plot,
violin plot, and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
'''
x_min = personal_loan_df[x].min()
x_max = personal_loan_df[x].max()
Q1 = personal_loan_df[x].quantile(0.25)
Q2 = personal_loan_df[x].quantile(0.50)
Q3 = personal_loan_df[x].quantile(0.75)
dict={'Min': x_min, 'Q1': Q1, 'Q2': Q2, 'Q3': Q3, 'Max': x_max}
df = pd.DataFrame(dict, index=['Value'])
print(f'5 Point Summary of {x.capitalize()} Attribute:\n')
print(tabulate(df, headers = 'keys', tablefmt = 'psql'))
fig = plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace = 0.6)
sns.set_palette('Pastel1')
plt.subplot(221, frameon=True)
ax1 = sns.histplot(personal_loan_df[x], color = 'purple')
ax1.axvline(
np.mean(personal_loan_df[x]), color="purple", linestyle="--"
) # Add mean to the histogram
ax1.axvline(
np.median(personal_loan_df[x]), color="black", linestyle="-"
) # Add median to the histogram
plt.title(f'{x} Density Distribution')
plt.subplot(222, frameon=True)
ax2 = sns.violinplot(x = personal_loan_df[x], palette = 'Accent', split = True)
plt.title(f'{x.capitalize()} Violinplot')
plt.subplot(223, frameon=True, sharex=ax1)
ax3 = sns.boxplot(x=personal_loan_df[x], palette = 'cool', width=0.7, linewidth=0.6, showmeans=True)
plt.title(f'{x.capitalize()} Boxplot')
plt.subplot(224, frameon=True, sharex=ax2)
ax4 = sns.kdeplot(personal_loan_df[x], cumulative=True) #
plt.title(f'{x} Cumulative Density Distribution')
plt.show()
Observation on Age
summary('Age')
Observations:
Age Distribution:
Key Age Statistics:
Observation on Experience
summary('Experience')
Observations:
Experience Distribution:
Key Experience Statistics:
Potential Correlation with Age:
Observations on Income
summary('Income')
Observations
Income Distribution:
Income Range:
High-End Outliers:
Observations on CCAvg
summary('CCAvg')
Observations
CC Avg Distribution:
Range:
High-End Outliers:
Observations on Mortgage
summary('Mortgage')
Observations
Mortgage Distribution:
Range:
High-End Outliers:
Outlier Treatment:
Percentage on Bar Chart for Categorical Features
Categorical variables are most effectively visualized as bar charts representing percentage of total, ensuring clearer insights into distribution patterns.
def perc_on_bar(cat_columns):
'''
The function takes a category column as input and plots bar chart with percentages on top of each bar
Usage:
------
perc_on_bar('county')
'''
num_cols = len(cat_columns)
# Calculate the number of rows needed
num_rows = (num_cols + 1) // 2 # Add 1 to ensure enough rows for odd numbers of columns
plt.figure(figsize=(16, 14))
for i, col in enumerate(cat_columns):
plt.subplot(num_rows, 2, i + 1) # Use calculated num_rows and 2 columns
order = personal_loan_df[col].value_counts(ascending=False).index # Use personal_loan_df instead of data
ax = sns.countplot(data=personal_loan_df, x=col, palette='crest', order=order) # Use personal_loan_df instead of data
for p in ax.patches:
percentage = '{:.1f}%\n({})'.format(100 * p.get_height() / len(personal_loan_df['Personal_Loan']), p.get_height())
# Added percentage and actual value
x = p.get_x() + p.get_width() / 2
y = p.get_y() + p.get_height() + 40
plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium') # Annotation on top of bars
plt.xticks(color='black', fontsize='medium', rotation=(-90 if col == 'region' else 0))
plt.tight_layout()
plt.title(col.capitalize() + ' Percentage Bar Charts\n\n')
cat_columns = personal_loan_df.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().tolist() # Only numerical columns
# Check if elements exist before removing them
if 'county' in cat_columns:
cat_columns.remove('zip_code')
if 'personal_loan' in cat_columns:
cat_columns.remove('Personal_loan')
if 'age_bin' in cat_columns:
cat_columns.remove('age_bin')
if 'income_bin' in cat_columns:
cat_columns.remove('income_bin')
if 'cc_spending_bin' in cat_columns:
cat_columns.remove('cc_spending_bin')
if 'Age_Group' in cat_columns:
cat_columns.remove('Age_Group')
if 'Family' in cat_columns:
cat_columns.remove('Family')
perc_on_bar(cat_columns)
Observations
Education Level:
Financial Product Ownership:
Banking Preferences:
Personal Loan Uptake:
Key Implication:
Potential Actions:
cat_columns = ['income_bin', 'cc_spending_bin', 'age_bin']
perc_on_bar(cat_columns)
Observations
Age Distribution:
Income Levels:
Credit Card Spending Patterns:
Bivariate Analysis
Bivariate analysis focuses on identifying relationships between two variables to determine patterns, dependencies, or correlations. It plays a crucial role in understanding how one variable may influence another.
Personal Loans vs. All Numerical Columns
#This function plots box charts for each numerical feature grouped by Personal Loan status (0: Not Borrowed, 1: Borrowed). It helps visualize distributions, identify trends, and detect outliers in loan acceptance analysis.
plt.style.use('ggplot') # Setting plot style
numeric_columns = personal_loan_df.select_dtypes(include=np.number).columns.unique().tolist() # Only numerical columns
# Check if 'ZIPCode' is in the list before attempting to remove it
if 'ZIPCode' in numeric_columns:
numeric_columns.remove('ZIPCode') # Excluding zip code, as there are too many, and it won't make sense
# Add 'Family' to the list of numerical columns (it's treated as numerical here for the boxplot)
numeric_columns.append('Family')
plt.figure(figsize=(20,30))
for i, col in enumerate(numeric_columns):
plt.subplot(8,2,i+1)
sns.boxplot(data=personal_loan_df, x='Personal_Loan', y=col, orient='vertical', palette="Blues")
plt.xticks(ticks=[0,1], labels=['No (0)', 'Yes (1)'])
plt.tight_layout()
plt.title(str(i+1)+ ': Personal Loan vs. ' + col, color='black')
Observations
No Clear Clustering Pattern:
Income Influence:
Family Size Impact:
Mortgage Influence:
Credit Card Spending Correlation:
Personal Loan vs. Education
# This function takes a categorical column as input and visualizes percentage distributions using bar charts, pie charts and stacked charts.
def cat_view(x):
"""
Function to create a Bar chart and a Pie chart for categorical variables.
"""
from matplotlib import cm
color1 = cm.inferno(np.linspace(.4, .8, 30))
color2 = cm.viridis(np.linspace(.4, .8, 30))
sns.set_palette('cubehelix')
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
"""
Draw a Pie Chart on first subplot.
"""
s = personal_loan_df.groupby(x).size()
mydata_values = s.values.tolist()
mydata_index = s.index.tolist()
def func(pct, allvals):
absolute = int(pct/100.*np.sum(allvals))
return "{:.1f}%\n({:d})".format(pct, absolute)
wedges, texts, autotexts = ax[0].pie(mydata_values, autopct=lambda pct: func(pct, mydata_values),
textprops=dict(color="w"))
ax[0].legend(wedges, mydata_index,
title=x.capitalize(),
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=12)
ax[0].set_title(f'{x.capitalize()} Pie Chart')
"""
Draw a Bar Graph on second subplot.
"""
# Changed 'income' to 'Income' to match the actual column name in personal_loan_df
df = pd.pivot_table(personal_loan_df, index = [x], columns = ['Personal_Loan'], values = ['Income'], aggfunc = len)
labels = df.index.tolist()
loan_no = df.values[:, 0].tolist()
loan_yes = df.values[:, 1].tolist()
l = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
rects1 = ax[1].bar(l - width/2, loan_no, width, label='No Loan', color = color1)
rects2 = ax[1].bar(l + width/2, loan_yes, width, label='Loan', color = color2)
# Add some text for labels, title and custom x-axis tick labels, etc.
ax[1].set_ylabel('Scores')
ax[1].set_title(f'{x.capitalize()} Bar Graph')
ax[1].set_xticks(l)
ax[1].set_xticklabels(labels)
ax[1].legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax[1].annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
fontsize = 'medium',
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
"""
Draw a Stacked Bar Graph on bottom.
"""
sns.set(palette="tab10")
# Changed 'personal_loan' and 'data' to 'Personal_Loan' and 'personal_loan_df' respectively.
tab = pd.crosstab(personal_loan_df[x], personal_loan_df['Personal_Loan'].map({0:'No Loan', 1:'Loan'}), normalize="index")
tab.plot.bar(stacked=True, figsize=(16, 3))
plt.title(x.capitalize() + ' Stacked Bar Plot')
plt.legend(loc="upper right", bbox_to_anchor=(0,1))
plt.show()
cat_view('Education')
Observations
Loan Uptake by Education Level:
Possible Explanation:
Potential Action:
Personal Loan vs. Family
cat_view('Family')
Observations
Loan Uptake by Family Size:
Possible Explanation:
Potential Action:
Persoanl Loan vs. Securities Account
cat_view('Securities_Account')
Observations
Loan vs. Securities Account Relationship:
Possible Interpretation:
Potential Action:
Personal Loan vs. Online Banking
cat_view('Online')
Observations
Impact of Online Banking on Loan Uptake:
Possible Explanation:
Potential Action:
Personal Loan vs. CD Account
cat_view('CD_Account')
Observations
Certified Deposit & Loan Uptake:
Majority Behavior:
Potential Action:
Personal Loan vs. Credit Card
cat_view('CreditCard')
Observations
Loan Adoption vs. External Credit Cards:
Possible Interpretation:
Potential Action:
Personal Loans vs. Age
cat_view('age_bin')
Observations
Loan Adoption Among Middle-Aged Customers:
Possible Explanation:
Potential Action:
Personal Loan vs. Zip Code
# prompt: show bivariate relationship between personal loan (no loan or loan) vs. the top 10 Zip codes as bar chart
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'personal_loan_df' is your DataFrame
# Get the top 10 zip codes
top_10_zipcodes = personal_loan_df['ZIPCode'].value_counts().nlargest(10).index
# Filter the DataFrame to include only the top 10 zip codes
top_zip_df = personal_loan_df[personal_loan_df['ZIPCode'].isin(top_10_zipcodes)]
plt.figure(figsize=(12, 6))
sns.countplot(data=top_zip_df, x='ZIPCode', hue='Personal_Loan')
plt.title('Personal Loan vs. Top 10 Zip Codes')
plt.xlabel('Zip Code')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Personal Loan')
plt.tight_layout()
plt.show()
Personal Loan vs. Income
cat_view('income_bin')
Observations
Loan Adoption Among High-Income Customers:
Possible Explanation:
Potential Action:
Personal Loan vs. CC Average
cat_view('cc_spending_bin')
Observations
Loan Adoption Among High-Spending Customers:
Customers with high expenditures in the $2.5K–$10K range are more likely to have taken a Personal Loan, suggesting a correlation between spending behavior and loan uptake. Possible Explanation:
High-spending individuals may have greater financial needs, leading them to seek loans for liquidity or lifestyle maintenance. They might also use loans strategically to manage cash flow, handle large purchases, or consolidate debt. Potential Action:
Banks could offer personalized loan products for high-spending customers, with features like flexible repayment plans or credit-linked benefits. Investigate whether spending categories (e.g., travel, luxury goods, or business expenses) influence loan adoption, refining financial insights further.
Multivariate Analysis
Education Level vs. Income by Personal Loan
# Below code shows swarm plot for customers by Income and Education level, seggregated by Personal Loan opted or not
sns.set(palette='icefire')
plt.figure(figsize=(15,5))
# Changed 'education', 'income' and 'personal_loan' to 'Education', 'Income', and 'Personal_Loan' to match column names.
sns.swarmplot(data=personal_loan_df, x='Education', y='Income', hue='Personal_Loan').set(title='Swarmplot: Education vs Income by Personal Loan\n0: Not Borrowed, 1: Borrowed');
plt.legend(loc="upper left" ,title="Opted Personal Loan", bbox_to_anchor=(1,1));
Observations
Age vs. Mortgage Value by Personal Loan
Observations
Loan Uptake Among High-Mortgage Customers:
Possible Explanation
Income vs. Mortgage Value by Personal Loan
sns.set_palette('tab10')
# Changed 'income', 'mortgage', and 'personal_loan' to 'Income', 'Mortgage', and 'Personal_Loan' respectively.
sns.jointplot(data=personal_loan_df, x='Income', y='Mortgage', \
hue='Personal_Loan');
Income vs. CCAverage by Personal Loan
sns.set_palette('tab10')
# Changed 'income', 'mortgage', and 'personal_loan' to 'Income', 'Mortgage', and 'Personal_Loan' respectively.
sns.jointplot(data=personal_loan_df, x='Income', y='CCAvg', \
hue='Personal_Loan');
Observations
Financial Profile & Loan Uptake:
Significance of These Features:
Pairplot of all available numeric columns, hued by Personal Loan
# Below plot shows correlations between the numerical features in the dataset
plt.figure(figsize=(20,20));
sns.set(palette="nipy_spectral");
sns.pairplot(data=personal_loan_df, hue='Personal_Loan', corner=True); # Changed 'personal_loan' to 'Personal_Loan'
Heatmap to visualize and analyse correlations between independent and dependent variables
# Plotting correlation heatmap of the features
category_columns = ['Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Education'] # Changed column names to match actual names in the DataFrame
personal_loan_df[category_columns] = personal_loan_df[category_columns].astype('int')
# Selecting only numerical columns for correlation calculation
numerical_df = personal_loan_df.select_dtypes(include=np.number)
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
numerical_df.corr(), # Calculating correlation on numerical_df
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
personal_loan_df[category_columns] = personal_loan_df[category_columns].astype('category')
Observations
Correlation Between Age & Experience
Decision to Drop Experience
Income vs. Education by Personal Loan
sns.set(palette='Accent')
#Income Vs Education Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Income',x='Education',hue='Personal_Loan') # Changed 'income', 'education', and 'personal_loan' to 'Income', 'Education', and 'Personal_Loan' respectively
plt.show()
Observations
Education & Income Correlation:
Loan Adoption by Education Level:
Potential Action:
Income vs. Family Size by Personal Loan
#Income Vs Family Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Income',x='Family',hue='Personal_Loan') # Changed 'income', 'family', and 'personal_loan' to 'Income', 'Family', and 'Personal_Loan' respectively
plt.show()
Observations
Higher Income Across All Family Groups:
Possible Explanation:
Potential Action:
Mortgage Value vs. Family size by Personal Loan
#Mortage Value Vs Family Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Mortgage',x='Family',hue='Personal_Loan') # Changed 'mortgage', 'family', and 'personal_loan' to 'Mortgage', 'Family', and 'Personal_Loan' respectively
plt.show()
Observations
Outliers in Family Size (1-2) for Non-Borrowers:
Mortgage vs. Family Size Relationship:
Potential Action:
CC Average vs. Credit Card by Personal Loan
#CCAvg Vs Credit Card Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='CCAvg',x='CreditCard',hue='Personal_Loan') # Changed 'ccavg', 'creditcard', and 'personal_loan' to 'CCAvg', 'CreditCard', and 'Personal_Loan' respectively
plt.show()
Observations
Credit Card Usage & Loan Adoption:
Outliers in Non-Borrowers:
Potential Action:
Additional Insights from the EDA
Customer Segmentation Based on Financial Behavior:
Significance of These Features:
These observations provide powerful insights into customer loan adoption patterns and key predictors of personal loan uptake. Here’s a refined breakdown:
✔ Higher Income → Increased Loan Uptake
✔ Family Size Influence
✔ Mortgage & Loan Correlation
✔ Credit Card Usage & Loan Uptake
✔ Advanced/Professional Education → More Loan Borrowers
✔ Certified Deposit (CD) Accounts & Loan Trends
✔ Customers from ZIP CODES like 94720, 95616 and 94305 (amonsgt others) opt for personal loans more frequently, suggesting regional financial behaviors or market dynamics.
✔ Middle-aged customers (35–55 years) are the majority loan borrowers**, aligning with life stages where financial commitments peak.
📌 Income
📌 Family Size
📌 Education Level
📌 CD Account Ownership
📌 Region
Outlier Treatment for Right-Skewed Data:
Dropping Unnecessary Columns:
- Feature Removal for Optimization:
- Outlier Capping Using Whiskers:
Duplication of Dataset
# Create a deep copy of the original dataset before applying preprocessing
personal_loan_df = personal_loan_df.copy()
Statistical Summary
personal_loan_df.describe(include='all').T
Dropping Unneccessary Columns
# Correct the column names to match the DataFrame
personal_loan_df.drop(columns=['Experience', 'ZIPCode', 'income_bin', 'cc_spending_bin', 'age_bin'], inplace=True)
Updated Statiscal Summary
personal_loan_df.describe(include='all').T
Outlier Treatment
# prompt: how can I append family to the numerical column
def add_family_to_numerical(df, column_name):
"""
Appends "Family" to numerical column values in a Pandas DataFrame.
Args:
df: The Pandas DataFrame.
column_name: The name of the numerical column to modify.
Returns:
The modified DataFrame or None if the column does not exist.
"""
if column_name not in df.columns:
print(f"Error: Column '{column_name}' not found in the DataFrame.")
return None
# Ensure the target column is of a numeric type
if not pd.api.types.is_numeric_dtype(df[column_name]):
print(f"Error: Column '{column_name}' is not numeric.")
return None
# Convert numerical column values to strings and append "Family"
df[column_name] = df[column_name].astype(str) + "_Family"
return df
# Example usage (assuming 'personal_loan_df' is your DataFrame):
# modified_df = add_family_to_numerical(personal_loan_df, "Income")
# if modified_df is not None:
# print(modified_df.head())
numerical_col = personal_loan_df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,5))
for i, variable in enumerate(numerical_col):
plt.subplot(1,5,i+1)
plt.boxplot(personal_loan_df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Outlier Analysis in Key Financial Features:
Treating Outliers
To manage extreme values effectively, we will develop two functions following this approach:
✔ Values below the Lower Whisker will be adjusted to match the Lower Whisker threshold.
✔ Values above the Upper Whisker will be capped at the Upper Whisker limit, ensuring a balanced distribution. This method preserves data integrity
def treat_outliers(personal_loan_df,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
data: data frame
col: name of the column
'''
Q1=personal_loan_df[col].quantile(0.25) # 25th quantile
Q3=personal_loan_df[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
personal_loan_df[col] = np.clip(personal_loan_df[col], Lower_Whisker, Upper_Whisker)
# all the values smaller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whisker will be assigned value of upper_Whisker
return personal_loan_df
def treat_outliers_all(personal_loan_df, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
data: data frame
'''
for col in col_list:
personal_loan_df = treat_outliers(personal_loan_df,col)
return personal_loan_df
numerical_col = personal_loan_df.select_dtypes(include=np.number).columns.tolist()
# getting list of numerical columns
numerical_col.remove('Age')
numerical_col.remove('Family')
# treating outliers
data_loan = treat_outliers_all(personal_loan_df,numerical_col)
Verify Outlier Treatment
numerical_col =personal_loan_df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,5))
for i, variable in enumerate(numerical_col):
plt.subplot(1,5,i+1)
plt.boxplot(personal_loan_df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There are no more outliers in our dataset
Creating our Training and Testing Data
We'll split the dataset first into dependent and independent variable sets. Then we'll One-Hot encode the categorical columns Education and Zip Code only. The rest of the categorical columns are not being encoded as they hold binary values, 1 or 0. After that we split the datasets into training and testing dataset (30% for testing).
from sklearn.model_selection import train_test_split
# Assuming you have your features in 'X' and target in 'y'
# 'X' should now contain the dummy variables for 'Education' and 'ZIPCode'
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=6)
# Print shapes to verify the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print('Percentage of classes in training set:\n',y_train.value_counts(normalize=True)*100)
print('Percentage of classes in test set:\n',y_test.value_counts(normalize=True)*100)
We have split the dataset into training and testing sets. In both datasets, the target variable shows a 91:9 class distribution, where 91% of customers did not take a personal loan, while 9% did. This class imbalance will be considered when adjusting the class_weight parameter during model training to ensure better handling of the minority class.
Minimizing False Negatives in Loan Predictions
A model can make incorrect predictions in two ways:
✔ False Positive: Predicting a person will take a loan, but they actually don’t → Loss of Resources
✔ False Negative: Predicting a person won’t take a loan, but they actually do → Loss of Opportunity
Since the primary goal of the campaign is to bring in more customers, reducing False Negatives is the priority. If a potential customer is missed by the sales/marketing team, it represents a lost opportunity for conversion.
Optimizing for Recall
To minimize missed opportunities, the model should maximize Recall, ensuring it correctly identifies both classes. Higher Recall improves the chances of detecting potential loan adopters, even at the cost of slightly more False Positives.
*
Scoring and Confusion Matrix
def get_metrics_score(model,train,test,train_y,test_y,threshold=0.5,flag=True,roc=False):
'''
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = (model.predict_proba(train)[:,1]>threshold)
pred_test = (model.predict_proba(test)[:,1]>threshold)
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train,train_y)
test_acc = accuracy_score(pred_test,test_y)
train_recall = recall_score(train_y,pred_train)
test_recall = recall_score(test_y,pred_test)
train_precision = precision_score(train_y,pred_train)
test_precision = precision_score(test_y,pred_test)
train_f1 = f1_score(train_y,pred_train)
test_f1 = f1_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1,pred_train,pred_test))
if flag == True:
print("Accuracy on training set : ",accuracy_score(pred_train,train_y))
print("Accuracy on test set : ",accuracy_score(pred_test,test_y))
print("Recall on training set : ",recall_score(train_y,pred_train))
print("Recall on test set : ",recall_score(test_y,pred_test)) # Corrected indentation for Recall on test set
print("Precision on training set : ",precision_score(train_y,pred_train)) # Corrected indentation for Precision on training set
print("Precision on test set : ",precision_score(test_y,pred_test))
print("F1 on training set : ",f1_score(train_y,pred_train))
print("F1 on test set : ",f1_score(test_y,pred_test))
if roc == True:
pred_train_prob = model.predict_proba(train)[:,1]
pred_test_prob = model.predict_proba(test)[:,1]
print("ROC-AUC Score on training set : ",roc_auc_score(train_y,pred_train))
print("ROC-AUC Score on test set : ",roc_auc_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(model,test_X,y_actual,i,seg,labels=[1, 0]):
'''
model : classifier to predict values of X
test_X: test set
y_actual : ground truth
'''
y_predict = model.predict(test_X)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1,0])
df_cm = pd.DataFrame(cm, index = [i for i in ['Actual - Borrowed', 'Actual - Not Borrowed']],
columns = [i for i in ['Predicted - Borrowed','Predicted - Not Borrowed']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='', ax=axes[i], cmap='Blues').set(title='Confusion Matrix of {} Set'.format(seg))
# # defining empty lists to add train and test results
# # defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
def add_score_model(score):
'''Add scores to list so that we can compare all models score together'''
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
Logistic Regression
# Create 'income_bin' and 'cc_spending_bin' in personal_loan_df
personal_loan_df["income_bin"] = pd.cut(x=personal_loan_df["Income"], bins=[0, 39, 98, 224], labels=["Low", "Mid", "High"])
personal_loan_df["cc_spending_bin"] = pd.cut(x=personal_loan_df["CCAvg"], bins=[-0.0001, 0.7, 2.5, 10.0], labels=["Low", "Mid", "High"])
personal_loan_df["age_bin"] = pd.cut(x=personal_loan_df["Age"], bins=[0, 35, 55, 67], labels=["Young Adults", "Middle Aged", "Senior"])
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
if 'Experience' in personal_loan_df.columns:
X = personal_loan_df[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']] # Include the new columns
else:
# Handle the case where 'Experience' is missing, e.g., print a warning or use a different set of features
print("Warning: 'Experience' column not found in personal_loan_df. Using remaining features.")
X = personal_loan_df[['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]
y = personal_loan_df['Personal_Loan']
# Now perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
# Convert categorical features to numerical using one-hot encoding
X_train = pd.get_dummies(X_train,
columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
drop_first=True)
X_test = pd.get_dummies(X_test,
columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
drop_first=True)
Model Score
# Import the necessary libraries
from sklearn.linear_model import LogisticRegression
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
# ... (rest of your code) ...
# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
model1.fit(X_train, y_train)
# Now you can evaluate the model
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)
# ... (rest of your code) ...
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
if 'Experience' in personal_loan_df.columns:
X = personal_loan_df[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']] # Include the new columns
else:
# Handle the case where 'Experience' is missing, e.g., print a warning or use a different set of features
print("Warning: 'Experience' column not found in personal_loan_df. Using remaining features.")
X = personal_loan_df[['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]
y = personal_loan_df['Personal_Loan']
# Now perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
# Convert categorical features to numerical using one-hot encoding BEFORE fitting the model
X_train = pd.get_dummies(X_train,
columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
drop_first=True)
X_test = pd.get_dummies(X_test,
columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
drop_first=True)
# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
#Fit the model with the encoded data
model1.fit(X_train, y_train)
# Now you can evaluate the model, ensuring the input data (X_train, X_test) is the encoded version
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)
# ... (rest of your code) ...
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
def make_confusion_matrix(model, test_X, y_actual, i, seg, labels=None):
'''
This function plots a confusion matrix for a given model and data.
Args:
model: The trained machine learning model.
test_X: The feature data for testing.
y_actual: The actual target values for the test data.
i: Index for subplot (if creating multiple plots).
seg: Segment label (e.g., 'Training', 'Testing').
labels: Optional labels for the confusion matrix.
Returns:
None (displays the confusion matrix plot).
'''
y_predict = model.predict(test_X)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
df_cm = pd.DataFrame(cm, index=[i for i in ['Actual - Borrowed', 'Actual - Not Borrowed']],
columns=[i for i in ['Predicted - Borrowed', 'Predicted - Not Borrowed']])
if seg == 'Training':
# Adjust the position of the color bar for the training set
plt.subplot(1, 2, 1)
sns.heatmap(df_cm, annot=True, fmt=".0f", annot_kws={"size": 12}, cmap='Blues')
plt.title(seg + ' Confusion Matrix', color='black')
plt.tight_layout()
else:
# Adjust the position of the color bar for the testing set
plt.subplot(1, 2, 2)
sns.heatmap(df_cm, annot=True, fmt=".0f", annot_kws={"size": 12}, cmap='Blues')
plt.title(seg + ' Confusion Matrix', color='black')
plt.tight_layout()
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
# ... (rest of your code from the previous response, including the one-hot encoding before the split) ...
# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
model1.fit(X_train, y_train)
# Now you can evaluate the model and create the confusion matrices
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)
# Create the confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# Using subplots to create a side-by-side view
make_confusion_matrix(model1, X_train, y_train, i=0, seg='Training')
make_confusion_matrix(model1, X_test, y_test, i=1, seg='Testing')
plt.show() # Display the confusion matrices
ROC
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score, roc_curve # Import necessary functions
fig, axes = plt.subplots(1, 2, figsize=(20, 7))
# Assuming 'lg1' is your trained Logistic Regression model
lg1 = LogisticRegression(random_state=1) # Initialize the model
# Fit the model to your training data
lg1.fit(X_train, y_train) # This is the crucial step that was missing
# Now you can proceed with generating ROC curves and predictions
# ROC Curve for Training Data
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict_proba(X_train)[:, 1])
sns.lineplot(x=fpr, y=tpr, ax=axes[0]).set(
title="Receiver operating characteristic on Train\nLogistic Regression (area = %0.2f)"
% logit_roc_auc_train
)
axes[0].plot([0, 1], [0, 1], "r--")
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
# ROC Curve for Test Data
logit_roc_auc_test = roc_auc_score(y_test, lg1.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict_proba(X_test)[:, 1])
sns.lineplot(x=fpr, y=tpr, ax=axes[1]).set(
title="Receiver operating characteristic on Test\nLogistic Regression (area = %0.2f)"
% logit_roc_auc_test
)
axes[1].plot([0, 1], [0, 1], "r--")
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel("False Positive Rate")
axes[1].set_ylabel("True Positive Rate")
plt.show()
The Logistic Regression model is performing well on both the training and test sets, but its recall is poor. Since the dataset has a 91:9 class imbalance, the model is biased toward the majority class (non-loan customers), making it difficult to correctly identify potential loan adopters. To address this imbalance, we will adjust the class_weight parameter to ensure better identification of minority-class customers (loan adopters) and improve recall
# Encode categorical variables using one-hot encoding.
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)
# Ensure the training and test sets have the same columns after encoding.
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)
# Initialize the DecisionTreeClassifier with the Gini impurity criterion and a fixed random state.
model = DecisionTreeClassifier(
criterion="gini",
random_state=1,
# max_depth=5, # Limit the maximum depth of the tree
# min_samples_split=10, # Minimum number of samples required to split an internal node
# min_samples_leaf=5, # Minimum number of samples required to be at a leaf node
# max_leaf_nodes=20 # Maximum number of leaf nodes
)
# Fit the model on the encoded training data.
model.fit(X_train_encoded, y_train)
# Predict on the training data.
y_train_pred = model.predict(X_train_encoded)
# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred)
print(f'Accuracy: {accuracy:.2f}')
# Calculate precision.
precision = precision_score(y_train, y_train_pred)
print(f'Precision: {precision:.2f}')
# Calculate recall.
recall = recall_score(y_train, y_train_pred)
print(f'Recall: {recall:.2f}')
# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred)
print(f'F1 Score: {f1:.2f}')
# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred))
# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred)
# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Function for model performance evaluation.
def model_performance_classification_sklearn(model, X, y):
# Predict the labels for the input features X using the provided model.
y_pred = model.predict(X)
# Calculate the accuracy of the model.
accuracy = accuracy_score(y, y_pred)
# Calculate the precision of the model.
precision = precision_score(y, y_pred)
# Calculate the recall of the model.
recall = recall_score(y, y_pred)
# Calculate the F1 score of the model.
f1 = f1_score(y, y_pred)
# Return a dictionary containing the performance metrics.
return {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}
# Check performance on training data.
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train_encoded, y_train
)
decision_tree_perf_train
# Visualize the decision tree.
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model, # type: ignore
feature_names=list(X_train_encoded.columns), # type: ignore
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(model, feature_names=list(X_train_encoded.columns), show_weights=True) # type: ignore
print(tree_rules)
# Assign the feature importances to a variable.
importances = model.feature_importances_
# Sort indices in descending order.
indices = np.argsort(importances)[::-1]
# Create a DataFrame for feature importances.
feature_importances_df = pd.DataFrame({
'Feature': [list(X_train_encoded.columns)[i] for i in indices], # type: ignore
'Importance': importances[indices]
})
# Print the DataFrame.
print(feature_importances_df)
# Check model performance on the test data.
# Predict on the test data.
y_test_pred = model.predict(X_test_encoded) # type: ignore
# Calculate accuracy.
accuracy = accuracy_score(y_test, y_test_pred)
print(f'Accuracy: {accuracy:.2f}')
# Calculate precision.
precision = precision_score(y_test, y_test_pred)
print(f'Precision: {precision:.2f}')
# Calculate recall.
recall = recall_score(y_test, y_test_pred)
print(f'Recall: {recall:.2f}')
# Calculate F1 score.
f1 = f1_score(y_test, y_test_pred)
print(f'F1 Score: {f1:.2f}')
# Print classification report.
print('Classification Report:')
print(classification_report(y_test, y_test_pred))
# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_test_pred)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Check performance on the test data.
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test_encoded, y_test) # type: ignore
decision_tree_perf_test
Pre-Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Hyperparameter grid.
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations.
acc_scorer = make_scorer(recall_score)
# Run the grid search.
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train_encoded, y_train) # type: ignore
# Set the clf to the best combination of parameters.
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train_encoded, y_train) # type: ignore
# Check model performance on the test data.
# Predict on the test data.
y_test_pred = model.predict(X_train_encoded) # type: ignore
# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred) # type: ignore
print(f'Accuracy: {accuracy:.2f}')
# Calculate precision.
precision = precision_score(y_train, y_train_pred) # type: ignore
print(f'Precision: {precision:.2f}')
# Calculate recall.
recall = recall_score(y_train, y_train_pred) # type: ignore
print(f'Recall: {recall:.2f}')
# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred) # type: ignore
print(f'F1 Score: {f1:.2f}')
# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred)) # type: ignore
# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred) # type: ignore
# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Check performance on the tuned data.
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator, X_train_encoded, y_train) # type: ignore
decision_tree_tune_perf_train
# Visualize the decision tree.
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
estimator,
feature_names=list(X_train_encoded.columns), # type: ignore
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Predict on the training data.
y_train_pred = estimator.predict(X_train_encoded) # type: ignore
# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred)
print(f'Accuracy: {accuracy:.2f}')
# Calculate precision.
precision = precision_score(y_train, y_train_pred)
print(f'Precision: {precision:.2f}')
# Calculate recall.
recall = recall_score(y_train, y_train_pred)
print(f'Recall: {recall:.2f}')
# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred)
print(f'F1 Score: {f1:.2f}')
# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred))
# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred)
# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(estimator, feature_names=list(X_train_encoded.columns), show_weights=True) # type: ignore
print(tree_rules)
# Check performance on the test data.
decision_tree_tune_perf_test = model_performance_classification_sklearn(model, X_test_encoded, y_test)
decision_tree_tune_perf_test
# Compute the pruning path for the decision tree using minimal cost-complexity pruning.
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train_encoded, y_train) # type: ignore
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Create a DataFrame and sort by ccp_alphas in descending order.
pruning_path_df = pd.DataFrame(path)
pruning_path_df_sorted = pruning_path_df.sort_values(by='ccp_alphas', ascending=False)
# Show sorted pruning path values.
pruning_path_df_sorted
# Create a figure and an axis object with a specified size.
fig, ax = plt.subplots(figsize=(10, 5))
# Plot the relationship between effective alpha and total impurity of leaves
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
# Set the label for the x-axis.
ax.set_xlabel("effective alpha")
# Set the label for the y-axis.
ax.set_ylabel("total impurity of leaves")
# Set the title of the plot.
ax.set_title("Total Impurity vs effective alpha for training set")
# Display the plot.
plt.show()
# Initialize an empty list to store the decision tree classifiers.
clfs = []
# Iterate over the list of ccp_alpha values.
for ccp_alpha in ccp_alphas:
# Initialize a DecisionTreeClassifier with the current ccp_alpha value
# and a fixed random state.
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
# Fit the decision tree classifier on the training data.
clf.fit(X_train_encoded, y_train) # type: ignore
# Append the fitted classifier to the list.
clfs.append(clf)
# Print the number of nodes in the last tree and the corresponding ccp_alpha value.
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
# Remove the last element from the list of classifiers and ccp_alphas.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Calculate the number of nodes for each classifier.
node_counts = [clf.tree_.node_count for clf in clfs]
# Calculate the depth of each classifier.
depth = [clf.tree_.max_depth for clf in clfs]
# Create a figure with two subplots, arranged vertically, with a specified size.
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
# Plot the number of nodes vs alpha on the first subplot.
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
# Plot the depth of the tree vs alpha on the second subplot.
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
# Adjust the layout to prevent overlap.
fig.tight_layout()
# Initialize an empty list to store recall scores for the training set.
recall_train = []
# Iterate over the list of classifiers.
for clf in clfs:
# Predict the labels for the training set using the current classifier.
pred_train = clf.predict(X_train_encoded)
# Calculate the recall score for the training set.
values_train = recall_score(y_train, pred_train)
# Append the recall score to the recall_train list.
recall_train.append(values_train)
# Initialize an empty list to store recall scores for the test set.
recall_test = []
# Iterate over the list of classifiers.
for clf in clfs:
# Predict the labels for the test set using the current classifier.
pred_test = clf.predict(X_test_encoded) # type: ignore
# Calculate the recall score for the test set.
values_test = recall_score(y_test, pred_test)
# Append the recall score to the recall_test list.
recall_test.append(values_test)
# Create a figure and an axis object with a specified size.
fig, ax = plt.subplots(figsize=(15, 5))
# Set the label for the x-axis.
ax.set_xlabel("alpha")
# Set the label for the y-axis.
ax.set_ylabel("Recall")
# Set the title of the plot.
ax.set_title("Recall vs alpha for training and testing sets")
# Plot the recall scores for the training set vs alpha.
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
# Plot the recall scores for the test set vs alpha.
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
# Add a legend to the plot.
ax.legend()
# Display the plot.
plt.show()
# Find the index of the classifier with the highest recall score on the test set.
index_best_model = np.argmax(recall_test)
# Select the classifier corresponding to the best recall score.
best_model = clfs[index_best_model]
# Print the details of the best model.
print(best_model)
Post Pruning
# Initialize the DecisionTreeClassifier with the last ccp_alpha value from the pruning path,
# set class weights to handle class imbalance, and set a random state for reproducibility.
estimator_2 = DecisionTreeClassifier(
ccp_alpha=ccp_alphas[-1], # Use the last ccp_alpha value.
class_weight={0: 0.15, 1: 0.85}, # Set class weights.
random_state=1 # Set random state for reproducibility.
)
# Fit the classifier on the training data.
estimator_2.fit(X_train_encoded, y_train) # type: ignore
# Visualize the decision tree.
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=list(X_train_encoded.columns), # type: ignore
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(estimator_2, feature_names=list(X_train_encoded.columns),show_weights=True) # type: ignore
print(tree_rules)
# Check performance on the test data.
decision_tree_tune_post_train = model_performance_classification_sklearn(model, X_train_encoded, y_train) # type: ignore
decision_tree_tune_post_train
# Assign the feature importances to a variable.
importances = estimator_2.feature_importances_
# Sort indices in descending order.
indices = np.argsort(importances)[::-1]
# Create a DataFrame for feature importances.
feature_importances_df = pd.DataFrame({
'Feature': [list(X_train_encoded.columns)[i] for i in indices], # type: ignore
'Importance': importances[indices]
})
# Print the DataFrame.
print(feature_importances_df)
# Check performance on the training data.
decision_tree_tune_post_train = model_performance_classification_sklearn(estimator_2, X_train_encoded, y_train) # type: ignore
decision_tree_tune_post_train
# Training data performance comparison.
# Convert dictionaries to DataFrames
decision_tree_perf_train_df = pd.DataFrame.from_dict(decision_tree_perf_train, orient='index') # type: ignore
decision_tree_tune_perf_train_df = pd.DataFrame.from_dict(decision_tree_tune_perf_train, orient='index')
# Concatenate the performance DataFrames along the columns
models_train_comp_df = pd.concat(
[decision_tree_perf_train_df, decision_tree_tune_perf_train_df], axis=1
)
# Set the column names for the concatenated DataFrame
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
# Print the training performance comparison
print("Training performance comparison:")
models_train_comp_df
# Test data performance comparison.
# Convert dictionaries to DataFrames
decision_tree_perf_test_df = pd.DataFrame.from_dict(decision_tree_perf_test, orient='index')
decision_tree_tune_post_test_df = pd.DataFrame.from_dict(decision_tree_tune_post_train, orient='index')
# Concatenate the performance DataFrames along the columns
models_train_comp_df = pd.concat(
[decision_tree_perf_test_df, decision_tree_tune_post_test_df], axis=1
)
# Set the column names for the concatenated DataFrame
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Post-Pruning)"]
# Print the training performance comparison
print("Test performance comparison:")
models_train_comp_df
✔ Gather additional insights into loan rejections (e.g., customers defaulting with other banks).
✔ Collect customer satisfaction data across Online Banking, CD & Securities accounts, and Credit Card services to assess influence on loan adoption.
✔ Analyze payment history to identify missed payments, helping tailor loan offers based on financial stability.
✔ Track customer loyalty duration to measure long-term banking relationships and retention strategies.
✔ Assess whether additional features—like customer demographics—could enhance marketing campaigns and targeted services.
✔ Review ZIP code data for location-based loan strategies (e.g., higher loan amounts for high-income regions).
✔ Identify lower-income customers and explore offering smaller loan amounts with reduced rates to improve accessibility.
✔ Introduce a loyalty program with competitive interest rates and reduced service fees to incentivize long-term customers.
✔ Utilize the model to automate parts of the personal loan approval process, reducing manual workload and improving efficiency.
✔ Implement stricter credit checks or customized loan amounts to mitigate risk—leveraging data from income and credit card usage.
✔ Establish continuous monitoring and periodic model updates to keep predictions aligned with evolving financial trends.